image.png

Analysis of Stock Data

Final Project of the Udacity Nanodegree Data Scientist Program

Motivation & Outline

For the final project of the Udacity Nanodegree Data Scientist Program I needed to decide on what kind of data I will work on, applying the acquired skills.

Since I did not want to put too much effort in analyzing a data set that I didn't need in the near future even after completing the Nanodegree (wether at home or at work), I chose financial data. A year ago after my first Udacity Course (Data Analyst), I already wrote a program that scrapes, transforms, visualizes and stores my personal financial data from pdf-documents (statements of earnings, statements of bank accounts and insurance data) in order to monitor the development over time, identify trends and keep track on the variety of information. For obvious confidential reasons those are not going to be part of the current project.

At the same moment I started for the first time of my life investing in stocks, funds and ETFs (which seemingly everybody did during the Covid-19 pandemic). Well knowing that I am not going to be the next Warren Buffett, I still enjoy studying in a field that I hadn't entered before and that will for sure affect me for the rest of my life. This is exactly why I want to use financial stock data for the underlying data science project with python. So far, timeseries data hasn't yet been the focus of the classes, so that I additionally needed to do some research (amongst others with the help of the free Udaciy courses "Timeseries Forecasting" and "Machine Learning for Trading").

The project requirements/steps are the following:

  • use of a github repository
  • problem definition
  • data investigation and processing (including machine learning)
  • presentation/visualization of the results
  • blog post or web app deployment

Project Idea/Plan:

  • create a pipeline gathering stock data, calculating features and saving on personal NAS
  • use machine learning to estimate performance, trend and eventually prices
  • create automated daily report with visualizations to be aware of changes in stocks
  • send report and/or special info/warnings automated via mail
  • publish newest report and non-confidential insights on personal web app (create with Flask/Bootstrap and deploy on personal NAS Server)
  • create Python module for standard plots and calculations
  • use git for version control on remote repository on personal NAS (and share on github)
  • write and publish an article about the project (homepage and or medium)

Disclaimer

Please note, that all insights, data, findings and predictions that I make throughout the project shall not be used as basis to trade stocks in any way! There might be mistakes, incomplete analyses and biased conclusions. No Guarantees!

Problem Definition

As I mentioned before, I am currently holding stocks and ETFs and plan on eventually acquiring more in the future (without risking much). This being said, the following questions pop up:

  1. How well are the stocks performing compared to the past and/or to each other?
    • How or with which features can i quantify the performance?
  2. How well will the stocks perform in the near future and can their performance be predicted?
    • Which indicators can foreshadow future price movements and where are the limits of ML algorithms?

All of the above questions are relevant for a further question: "In which stocks or markets should I invest in the future?".

Gathering and Wrangling the Stock Data

The majority of the stated questions/tasks can be exemplarily assessed with the historical data and meta data of one single stock. However I plan on working with functions so that the upcoming algorithms can be applied to almost any kind of stock or market.

Being aware that there has already been masses of similar projects and researches, I will try to use as many useful existing APIs, Modules and Strategies to get everything to work and save brain capacity (feel free to checkout the credits and sources at the end of the notebook).

I tried 3 common (and mostly free) Python APIs to gather historical stock data with. Quandl for example seems to be well known but I had some issues navigating through their data bases and finding the stock data that I wanted. YFinance (Yahoo Finance) comes in quite handy but nevertheless I decided to use Alpha Vantage (limited to 5 requests per minute and 500 per day) which delivers a lot of data in a comfortable way. Take into account that you have to get a free API key first (same accounts for Quandl).

About the time intervals: Having a regular job, I can't react on price movement within hours or even minutes, which is why I am totally fine with the historical data on a daily basis.

In [20]:
# imports
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.io as pio
import plotly as py
import datetime
import random
import re
import requests
import time
import yfinance as yf
import ta
import pandas_ta
import smtplib
import json
import tensorflow as tf
from io import BytesIO
from functools import reduce
from pandas.tseries.offsets import DateOffset
from chart_studio import plotly
from plotly.offline import iplot
from plotly.subplots import make_subplots
from email.message import EmailMessage
from keras.preprocessing.sequence import TimeseriesGenerator
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from scipy import signal
from IPython.core.display import display, HTML
from tqdm.notebook import tqdm
from alpha_vantage.timeseries import TimeSeries
from alpha_vantage.fundamentaldata import FundamentalData
from alpha_vantage.techindicators import TechIndicators
from alpha_vantage.sectorperformance import SectorPerformances
from scipy.signal import argrelextrema
from statsmodels.tsa.seasonal import seasonal_decompose

%matplotlib inline

#widen notebook
display(HTML("<style>.container { width:90% !important; }</style>"))  # increase display width of notebook
# enable html export with working plotly plots?!
pio.renderers.default = "notebook"

# set displaying options for pandas and matplotlib
pd.set_option("display.float_format", lambda x: "%.2f" % x)
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_columns', 100)
# pd.set_option("display.max_rows", None, "display.max_columns", None)
plt.rcParams['figure.figsize'] = [8, 6]
plt.rcParams['figure.dpi'] = 100 # 200 e.g. is really fine, but slower
#jtplot.style(theme="grade3", context="notebook", ticks=True, grid=False)

Available data

As described above, there are several (free) sources of stock data and I will mostly use the Alpha Vantage API, which provides theoretically the following data:

  • Historic Stock Data (OHLC, volume, splits, dividends)
  • Technical Indicators (e.g. SMA, EMA, RSI etc. -> can mostly be calculated later even with existing python modules, so no need to stress the API requests limit.
  • Fundamental Data
    • Calendar Earnings (information about the statements of earnings / dates of quarterly reports)
    • Company Overview (general up to date information about the company such as e.g. P/E ratio and more)
    • Section Performance (information about the performance of the different sections/industries right now)

Unfortunately there are still lots of stocks where not all data is present or even up to date (especially of stocks out of US markets) but at least the historic data is often provided. The following code contains my functions to quickly gather the described data if available. For the project revision I will provide the data as csv-files since I am not going to hand over my API keys.

In [21]:
# choose a stock to work on (default = BMW.DE, other suggestions: AAPL, MSFT, GOOG)
# symbol = str(input("Type the symbol of the stock that you want to analyze: " or "BMW.DE"))
symbol="BMW.DE"

For now the BMW stock will serve as an example, even though the code is intended to work for all stocks that are available at Alpha Vantage.

In [22]:
# get historical stock data from alpha vantage
def get_stockHistory(symbol):
    """
    input:
        symbol (str): String containing the exact ticker symbol of the stock of interest
    output:
        df (DataFrame): dataframe containing the stocks history data (open, high, low, close, close adj, volume, div, split)
        meta_data (dict): dictionary containing meta data about the stock of interest
    """
    # read api key from .txt
    with open('api_keys/alpha_vantage.txt') as f:
        api_key = f.readlines()[0]
        
    # make alpha_vantage api requests
    ts = TimeSeries(key=api_key, output_format="pandas")
    
    df, meta_data = ts.get_daily_adjusted(symbol=symbol, outputsize="full")  # get stock history data
    df.columns = ["open", "high", "low", "close", "close_adj", "volume", "div", "split"]  # rename columns
    
    return df, meta_data
In [23]:
# in case of missing data or limited requests with alpha vantage, i'll provide as well a function to request the data from yfinance
def get_stockHistory_YF(symbol):
    """
    input:
        symbol (str): String containing the exact ticker symbol of the stock of interest
    output:
        df (DataFrame): dataframe containing the stocks history data (open, high, low, close, close adj, volume, div, split)
    """
    obj = yf.Ticker(symbol)
    df = obj.history(period="max")
    # valid periods: 1d,5d,1mo,3mo,6mo,1y,2y,5y,10y,ytd,max
    # rename columns
    df.columns = ["open", "high", "low", "close", "volume", "div", "split"]
    # rename index
    df.index.names = ['date']
    
    return df
In [24]:
# get company data from alpha vantage
def get_company_overview(symbol):
    """
    input:
        symbol (str): String containing the exact ticker symbol of the stock of interest
    output:
        df (DataFrame): dataframe containing the stocks company data
    """
    # read api key from .txt
    with open('api_keys/alpha_vantage.txt') as f:
        api_key = f.readlines()[0]
    # make alpha_vantage api request
    fd = FundamentalData(key=api_key, output_format="pandas")
    df = fd.get_company_overview(symbol=symbol)
    df = df[0]
    
    return df
In [25]:
# show fundamental stock data if available
try:
    stockCompOver = get_company_overview(symbol)
    display(stockCompOver.T)
except Exception as e:
    print("Error for symbol '{}': {}".format(symbol, e))
Error for symbol 'BMW.DE': Error getting data from the api, no return was given.

Unfortunately there is no company overview data for the symbol "BMW.DE"

In [26]:
# get dates of quarterly reports from alpha vantage if available
def get_earnings_calendar(horizon, symbol):
    """
    input:
        horizon (str): timeperiod over which the dates of the companies earnings communication / quarterly reports are being published. Either "3month", "6month", or "12month".
        symbol (str): String containing the exact ticker symbol of the stock of interest
    output:
        df (DataFrame): dataframe containing the dates of the companies earnings communication / quarterly reports
    """
    BASE_URL = r"https://www.alphavantage.co/query?"
    
    with open('api_keys/alpha_vantage.txt') as f:
        api_key = f.readlines()[0]
        
    url = f'{BASE_URL}function=EARNINGS_CALENDAR&symbol={symbol}&horizon={horizon}&apikey={api_key}'
    response = requests.get(url)
    df = pd.read_csv(BytesIO(response.content))

    return df
In [27]:
try:
    stockEarnCal = get_earnings_calendar("12month", symbol)
except Exception as e:
    print("Error for symbol '{}': {}".format(symbol, e))
# show dates of quarterly earnings if available
stockEarnCal
Out[27]:
symbol name reportDate fiscalDateEnding estimate currency

Unfortunately there is no earnings calendar data for the symbol "BMW.DE"

In [28]:
# combined api request outputting as well technical indicators and eventually fundamental data if available
def get_stockData(symbol, hist=True, techInd=False, earnCal=False, compOver=False, secPerf=False):
    """
    input:
        symbol (str): String containing the exact ticker symbol of the stock of interest
        hist, techInd, earnCal, compOver, secPerf (bool): get historic/technical indicator/earnings calendar/company overview/section performance data if True
    output:
        df_x (DataFrame): dataframes containing the stocks data asked for with the input labels
    """
    # create dataframe shells
    df_ts = pd.DataFrame()
    df_ti = pd.DataFrame()
    df_sma = pd.DataFrame()
    df_ema = pd.DataFrame()
    df_rsi = pd.DataFrame()
    df_adx = pd.DataFrame()
    df_mom = pd.DataFrame()
    df_bb = pd.DataFrame()
    df_ec = pd.DataFrame()
    df_co = pd.DataFrame()
    df_sp = pd.DataFrame()
    
    # read api key from .txt
    with open('api_keys/alpha_vantage.txt') as f:
        api_key = f.readlines()[0]
        
    # make alpha_vantage api requests
    ts = TimeSeries(key=api_key, output_format="pandas")
    ti = TechIndicators(key=api_key, output_format="pandas")
    fd = FundamentalData(key=api_key, output_format="pandas")
    sp = SectorPerformances(key=api_key, output_format="pandas")
    
    # historical data
    if hist:
        df_ts, _ = ts.get_daily_adjusted(symbol=symbol, outputsize="full")  # get stock history data
        df_ts.columns = ["open", "high", "low", "close", "close_adj", "volume", "div", "split"]  # rename columns
    
    # technical indicators
    if techInd:
        df_sma, _ = ti.get_sma(symbol=symbol, interval='daily', time_period=60, series_type="close")    # get sma
        df_ema, _ = ti.get_ema(symbol=symbol, interval='daily', time_period=60, series_type="close")    # get ema
        df_rsi, _ = ti.get_rsi(symbol=symbol, interval='daily', time_period=60, series_type="close")    # get rsi
        # df_adx, _ = ti.get_adx(symbol=symbol, interval='daily', time_period=60)                       # get adx (ignored in order not to exceed 5 requests per minute)
        # df_mom, _ = ti.get_mom(symbol=symbol, interval='daily', time_period=60, series_type="close")  # get mom (ignored in order not to exceed 5 requests per minute)
        df_bb, _ = ti.get_bbands(symbol=symbol, interval='daily', time_period=60, series_type="close")  # get bbands
        df_bb.columns = ["BBmi", "BBlo", "BBup"]
    
    # earnings calendar
    if earnCal:
        df_ec, _ = fd.get_earnings_calendar(symbol=symbol, horizon="12month")

    # company overview
    if compOver:
        df_co = fd.get_company_overview(symbol=symbol)
        df_co = df_co[0]

    # sector performance info
    if secPerf:
        df_sp, _ = sp.get_sector()  # get sector
    
    # merge historical data with indicators
    dfs = [df_ts, df_sma, df_ema, df_rsi, df_adx, df_mom, df_bb] 
    df_comp = reduce(lambda left, right: pd.merge(left, right, how="outer", left_index=True, right_index=True), dfs)
    
    return df_comp, df_ts, df_ec, df_co, df_sp
In [29]:
# stockHist_comp, stockHist, _, _, _ = get_stockData(symbol, hist=True, techInd=False, earnCal=False, compOver=False, secPerf=False)
# stockHist_comp.head(3)
In [30]:
# creating csv containing stock history and txt containing meta data for testing without api requests
# stockHist.to_csv("data/datasets/stockHist.csv")
# json.dump(stockMeta, open("data/datasets/stockMeta.txt",'w'))
In [31]:
# # reading stock history from csv (for other users without api keys)
stockHist = pd.read_csv("data/datasets/stockHist.csv", index_col="date", parse_dates=True)

# # reading meta data from txt
# stockMeta = json.load(open("data/datasets/stockMeta.txt"))

stockHist
Out[31]:
open high low close close_adj volume div split
date
2021-12-16 89.75 90.31 89.27 89.64 89.64 1364574.00 0.00 1.00
2021-12-15 88.89 89.41 88.23 88.27 88.27 794212.00 0.00 1.00
2021-12-14 89.98 90.04 88.22 88.40 88.40 1116145.00 0.00 1.00
2021-12-13 89.80 91.88 89.55 89.88 89.88 1086537.00 0.00 1.00
2021-12-10 89.53 90.19 88.95 89.66 89.66 1415043.00 0.00 1.00
... ... ... ... ... ... ... ... ...
2005-01-07 34.69 34.73 34.31 34.60 19.82 1864405.00 0.00 1.00
2005-01-06 34.44 34.91 34.43 34.71 19.88 2130931.00 0.00 1.00
2005-01-05 34.22 34.69 34.05 34.54 19.78 3314502.00 0.00 1.00
2005-01-04 33.60 34.52 33.60 34.42 19.72 3613994.00 0.00 1.00
2005-01-03 33.41 33.85 33.40 33.75 19.33 1742708.00 0.00 1.00

4306 rows × 8 columns

In [32]:
stockHist.describe()
Out[32]:
open high low close close_adj volume div split
count 4306.00 4306.00 4306.00 4306.00 4306.00 4306.00 4306.00 4306.00
mean 62.41 63.11 61.64 62.39 48.25 2427249.85 0.01 1.00
std 22.29 22.43 22.09 22.27 21.82 1488484.45 0.15 0.00
min 17.28 17.82 16.00 17.04 10.98 0.00 0.00 1.00
25% 41.46 41.95 40.85 41.40 25.38 1462853.75 0.00 1.00
50% 64.69 65.27 63.95 64.63 50.59 2063199.00 0.00 1.00
75% 81.75 82.54 80.74 81.60 67.05 2895719.25 0.00 1.00
max 123.30 123.75 120.35 122.60 95.89 17588760.00 4.00 1.00
In [33]:
stockHist.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4306 entries, 2021-12-16 to 2005-01-03
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       4306 non-null   float64
 1   high       4306 non-null   float64
 2   low        4306 non-null   float64
 3   close      4306 non-null   float64
 4   close_adj  4306 non-null   float64
 5   volume     4306 non-null   float64
 6   div        4306 non-null   float64
 7   split      4306 non-null   float64
dtypes: float64(8)
memory usage: 302.8 KB

Fortunately by using historic stock data from the Alpha Vantage API, there is no big data wrangling necessary. There might be lots more work if the plan was to scrape fundamental data for each timestep in the past in order to improve the analysis. This won't be a part of this notebook though.

Difference between Close Values and adjusted Close Values

When I started comparing the stock data from Alpha Vantage to the diagrams that I found on different Broker websites I was irritated by differences in value which I later found out to be resulting in the distinction between adjusted and not adjusted close values. The adjusted close values are calculated with respect to dividends, splits and new offerings. Since the historical OHLC (Open, High, Low, Close) data relates to the not adjusted close values, I will use the adjusted value only when there is no interaction with the open, high, low or volume data. For stocks with splits in their historical data, I might need a more detailed approach to avoid bias for example in a machine learning model. By the way: The YFinance API doesn't provide adjusted close values, which is one of my reasons to use Alpha Vantage where possible.

In [34]:
# plot close and adjusted close prices of the entire stock history as well as split and dividends information
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.update_layout(xaxis_title="", yaxis_title="Price", template="plotly_white")
fig.update_yaxes(title_text="Dividends", secondary_y=True)
fig.add_trace(go.Scatter(x=stockHist.index, y=stockHist.close, mode="lines", name="close"))
fig.add_trace(go.Scatter(x=stockHist.index, y=stockHist.close_adj, mode="lines", name="close_adj"))
fig.add_trace(go.Bar(x=stockHist.index, y=stockHist["div"], name="dividends"), secondary_y=True)
fig.update_traces(marker_color='white', marker_line_color='darkgreen',
                  marker_line_width=2, width=1000 * 3600 * 24 * 31, opacity=0.6, secondary_y=True)
try:
    fig.add_vline(x=stockHist[stockHist.split!=1].split, line_color="orange", line_width=1, line_dash="dash")
except Exception as e:
    print("Error in Split-Data: "+str(e)+" -> Probably no splits found in historic data:")
    # check if there were any splits in the stocks history (values other than 1):
    print("Unique split values: ", stockHist.split.unique())
fig['layout']['yaxis2']['showgrid'] = False
fig.show()

fig.write_html("data/results/reports/close_plot.html")
Error in Split-Data: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). -> Probably no splits found in historic data:
Unique split values:  [1.]

Estimate Stock Performance/Trend/Price

Of course I would be naive thinking I could easily predict future stock movements with the basic knowledge that I have, but still... You have to start somewhere, right? Let's not rush things by trying to predict the prices for the next month but rather investigate the general ideas to understand and estimate price behaviour, then put the criteria in some scalable measures and finally try to at least assume an up or down trend a few days in the future.

Usually the analysis of a stock is divided into the fundamental analysis and the technical analysis. For a quick understanding of the underlying differences i found the following page to be helpful:

Fundamental Analysis

The fundamental analysis can be understood as "looking at aspects of a company in order to estimate its value". This can be an analysis of the companies general condition, its decisions and communications but it can also be the analysis of "outside influences" like the current pandemic, or even tweets of famous people about the company or its market. Fundamental analysis can be a matter of politics, nature catastrophes and more and is especially used for long term investments. Since the focus of this project is rather about short term decision making, I will no further discuss the fundamental analysis but rather the more interesting analysis for a programmatic approach: the technical analysis.

Technical Analysis

The technical analysis deals with the quantification of the company's stock performance usually rather in a short or mid term timespan. There is a huge amount of so called indicators that are calculated from available stock data that traders use to estimate current price movements. Those indicators can also be visual patterns in financial charts like a "candlestick diagram" or other strategies like seasonal decomposition (although I think that this technique is preferably applied on rather less unsteady timeseries like yearly sales etc.). Since this project is more about the programmatic approach and less about the financial background, I will introduce just a few indicators for basic use cases.

All upcoming features/indicators will be stored in a copy of the stockHist data frame as "stockHist_comp" (comp = "complex")

In [35]:
try:
    stockHist_comp  # might have been already created with the get_stockData function
except NameError:
    stockHist_comp = stockHist.copy()

Seasonality and Trend

With the help of the seasonal_decompose function of statsmodels, we can determine trend and seasonal patterns in the historic data. Assuming that we need a multiplicative decomposition, the components will look as in the following charts (period is set to 252 working days - approximately one year of data). The trend as its name already indicates shows the general movement of the stock price, whilst the seasonal component in the chosen timeperiod shows a seasonal/cyclic pattern indicating price movement happening each year. The residual component basically reflects uncertainty, stock volatility and unforseeable changes in price (e.g. Covid-19 impact at the beginning of 2020).

In [36]:
# use statsmodels seasonal_decompose to find seasonal/cyclic pattern in the historic data
decompose_result_mult = seasonal_decompose(stockHist.close_adj.iloc[::-1], model="multiplicative", period=252)  # , extrapolate_trend='freq'?

stockHist_comp["trend"] = decompose_result_mult.trend
stockHist_comp["seasonal"] = decompose_result_mult.seasonal
stockHist_comp["residual"] = decompose_result_mult.resid

fig = make_subplots(rows=2, cols=1, shared_xaxes=True, row_width=[0.5, 0.5])
fig.update_layout(title="Seasonal Decomposition", template="plotly_white")
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.close_adj, name="close_adj"), row=1, col=1)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.trend, name="trend"), row=1, col=1)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.seasonal, name="seasonal"), row=2, col=1)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.residual, name="residual"), row=2, col=1)

fig.show()
fig.write_html("data/results/reports/seasonal_decomposition_plot.html")

If you hide the residual by clicking on it, you will see the seasonal pattern that seems to occur each year. This could be provoked by the communication of quarterly reports (orange dashed lines in the next chart), seasonal client buying behaviour, distribution of dividends (green dashed lines in the next chart) or the possibility to buy "preference shares" (for BMW at the beginning of November).

In [37]:
def compare_years(df, feature, num_years=3):
    """
    input:
        df (DataFrame): data frame containing the features that shall be compared for each year
        feature (string): name of the feature that shall be compared for each year
        num_years (int): number of years in the past that shall be compared
    output:
        fig (plotly figure object): saved as html
    """
    current_year = datetime.datetime.now().year
    years = []
    for i in range(num_years+1):
        years.append(current_year-i)
    years = years[::-1]
    
    plot_layout = go.Layout(title="Comparison of {} in the last years".format(feature))
    fig = go.Figure(layout=plot_layout)
    fig.update_layout(xaxis_title="Workday", yaxis_title=feature, template="plotly_white")
    
    quarters = [df.loc[str(years[-2])].shape[0]/4, df.loc[str(years[-2])].shape[0]/2, df.loc[str(years[-2])].shape[0]/4*3]
    for quarter in quarters: fig.add_vline(x=quarter, line_width=2, line_dash="dash", line_color="orange")
    
    for year in years:
        df_temp = df.loc[str(year)].sort_values(by="date", ascending=True)
        df_temp.reset_index(inplace=True)

        fig.add_traces(go.Scatter(x=df_temp.index, y=df_temp[feature]/df_temp[feature], mode="lines", 
                                  name="Horizontal", showlegend=False))
        fig.add_traces(go.Scatter(x=df_temp.index, y=df_temp[feature], mode='lines', 
                                  name=feature+" "+str(year), fill="tonexty"))
        for placeholder in df_temp.loc[df_temp["div"]!=0].index.values: 
            fig.add_vline(x=placeholder, line_width=2, line_dash="dash", line_color="lightgreen")
        
    fig.show()
    
    fig.write_html("data/results/reports/{}_plot.html".format(feature))
In [38]:
compare_years(stockHist_comp, "seasonal", num_years=4)
In [39]:
compare_years(stockHist_comp, "residual", num_years=4)

The comparison of the residuals over the years can clearly show the impact of the pandemic start in march 2020 when all markets decreased significantly.

In [40]:
def seasonal_forecast(df):
    """
    input:
        df (DataFrame): data frame containing the seasonal component features and close_adj values
    output:
        fig (plotly figure object): 2 diagrams that show the seasonal pattern and a possible future predicted with it
    """
    feature = "seasonal"
    current_year = datetime.datetime.now().year

    plot_layout = go.Layout(
            title="{} in the current year compared to the last year".format(feature)
            )
    fig = go.Figure(layout=plot_layout)
    fig.update_layout(xaxis_title="Workday", yaxis_title=feature, template="plotly_white")

    df_current_year = df.loc[str(current_year)].sort_values(by="date", ascending=True)
    df_current_year.reset_index(inplace=True)
    df_current_year["close_adj_mean"] = df_current_year.close_adj.mean()
    df_last_year = df.loc[str(current_year-1)].sort_values(by="date", ascending=True)
    df_last_year.reset_index(inplace=True)

    # try to find the x-offset to synchronize the seasonal pattern
    lag = []
    x = df_last_year[feature].to_numpy(na_value=0)
    y = df_current_year[feature].to_numpy(na_value=0)
    correlation = signal.correlate(x, y, mode="full")
    lags = signal.correlation_lags(x.size, y.size, mode="full")
    lag = lags[np.argmax(correlation)]

    df_current_year = df_current_year.shift(lag)  # somehow the estimated lag doesn't always fit
    df_current_year.dropna(subset=["date"], inplace=True)

    fig.add_traces(go.Scatter(x=df_last_year.index, y=df_last_year[feature], mode='lines', name=feature+" last year", marker=dict(color="blue")))
    fig.add_traces(go.Scatter(x=df_current_year.index, y=df_current_year[feature], mode='lines', name=feature+" current year", marker=dict(color="red")))
    fig.show()
    fig.write_html("data/results/reports/seasonal_lag_plot.html")

    # predict future close_adj with seasonality (holding last price constant)
    future = df_last_year.loc[df_last_year.index > df_current_year.index.max(), ["seasonal"]]
    offset_correction = future.seasonal.values[0] * df_current_year.close_adj.values[-1] - df_current_year.close_adj.values[-1]
    future["close_adj"] = future.seasonal * df_current_year.close_adj.values[-1] - offset_correction # maybe better calculate with continued SMA?!
    future.index = pd.bdate_range(start=str(df_current_year.date.dropna().values[-1])[:10], end=str(datetime.datetime.now().year)+"-12-31")[:len(future)]

    fig = make_subplots(specs=[[{"secondary_y": True}]])
    fig.update_layout(
            title="Close_adj Prediction with seasonality holding everything constant",
            xaxis_title="Workday",
            yaxis_title="Close_adj", 
            template="plotly_white"
            )
    fig.add_trace(go.Scatter(x=df_current_year.date, y=df_current_year.close_adj, mode="lines", 
                             name="Past Close_adj", marker=dict(color="blue")))
    fig.add_trace(go.Scatter(x=future.index, y=future.close_adj, mode="lines", 
                             name="Future Close_adj", marker=dict(color="red")))
    fig.add_trace(go.Scatter(x=df_current_year.date, y=df_current_year.trend, mode="lines", 
                             name="Past Trend", marker=dict(color="black"), line_dash="dash"))
    fig.add_trace(go.Scatter(x=df_current_year.date, y=df_current_year.close_adj_mean, mode="lines", 
                             name="Close_adj Mean", marker=dict(color="lightblue"), line_dash="dash"), secondary_y=False)
    fig.add_trace(go.Scatter(x=df_current_year.date, y=df_current_year.close_adj.mean()*df_current_year.seasonal, 
                             mode="lines", name="Close_adj if trend was constant", fill="tonexty", marker=dict(color="lightblue")), secondary_y=False)
    fig.show()
    fig.write_html("data/results/reports/seasonal_predict_plot.html")
In [41]:
seasonal_forecast(stockHist_comp)

The lightblue trace (seasonality times mean of this years close_adj) shows that the tendencies are quite similar to the actual price movement (blue). Without regarding the unknown trend, the future close_adj could look like the red line taking ONLY seasonality into account. However, as seen in the charts before, the residual component is mostly outshining the seasonal component. Maybe there are other sectors/stocks with stronger seasonal or cyclic patterns.

Add common technical analysis features / indicators

In [42]:
# add technical analysis features with ta and or pandas_ta module
# ta.add_all_ta_features(stockHist_comp, open="open", high="high", low="low", close="close", volume="volume")
# stockHist_comp.drop(columns=["trend_psar_up", "trend_psar_down"], inplace=True)  # those columns contain too many nans
In [43]:
# calculate SMA and EMA of close_adj
stockHist_comp['SMA'] = stockHist_comp.sort_values(by="date", ascending=True)["close_adj"].rolling(window=20).mean()
stockHist_comp['EMA'] = stockHist_comp.sort_values(by="date", ascending=True)["close_adj"].ewm(span=20).mean()
# difference between close_adj value and its moving average
stockHist_comp["diffCloseSMA"] = stockHist_comp.close_adj - stockHist_comp.SMA
fig = go.Figure()
fig.update_layout(title="Difference of Closing Value and SMA", yaxis_title="Price", template="plotly_white")
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.SMA, name="SMA", line_color="blue"))
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.close_adj, name="close_adj", line_color="green", line_width=3))
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.close_adj+stockHist_comp.diffCloseSMA, name="close_adj + diffCloseSMA", opacity=.3, line_color="lime", fill="tonexty"))

fig.show()
fig.write_html("data/results/reports/diffCloseSMA_plot.html")

A significant lightgreen area on top of close_adj suggests selling, while the lightgreen areas beneath close_adj suggest buying.

In [44]:
# indicator that shows when the close_adj value crosses the bollinger band (even though more interesing would be the reentry in the bollinger band indicating trend)
stockHist_comp[["BBlow", "BBmid", "BBup", "BBwidth", "BBperc"]] = pandas_ta.bbands(close=stockHist_comp.sort_values(by="date", ascending=True)["close_adj"], length=20)
In [45]:
# RSI indicator (indicates overboughtness/oversoldness)
stockHist_comp["RSI"] = pandas_ta.rsi(close=stockHist_comp.sort_values(by="date", ascending=True)["close_adj"], length=10, append=True)
In [46]:
# create subplots showing Bollinger Bands, SMA and RSI in combination
# BB
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, row_width=[0.25, 0.75])
fig.update_layout(title="Bollinger Bands", yaxis_title="Price", template="plotly_white")
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.SMA, name="SMA", line_color="blue"), row=1, col=1)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.BBmid, name="BBmid", line_color="lightblue"), row=1, col=1)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.BBlow, name="BBlow", line_color="black"), row=1, col=1)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.BBup, name="BBup", line_color="black"), row=1, col=1)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.close_adj, name="close_adj", line_color="red"), row=1, col=1)

# RSI
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.RSI, name="RSI", line_color="orange"), row=2, col=1)

# Add upper/lower bounds
fig.update_yaxes(range=[-10, 110], row=2, col=1)
fig.add_hline(y=0, col=1, row=2, line_color="#666", line_width=2)
fig.add_hline(y=100, col=1, row=2, line_color="#666", line_width=2)

# Add overbought/oversold
fig.add_hline(y=30, col=1, row=2, line_color='#336699', line_width=2, line_dash='dash')
fig.add_hline(y=70, col=1, row=2, line_color='#336699', line_width=2, line_dash='dash')

fig.show()
fig.write_html("data/results/reports/bb_rsi_plot.html")

Crossings of the Bollinger Bands (BBperc would be negative or above 1) could indicate a soon movement back to the mean (SMA). Sell or buy recommendation would be triggered with re-entry into the bollinger band. The RSI is mostly used in combination with upper and lower limits such as 70%/30% indicating the stock to be overbought/oversold.

Create own features / indicators (experimental!)

In [47]:
# mark local extrema (to maybe predict if the next day is going to be a local minimum or maximum)
def mark_localExtrema(df, col="close", n=5, plot=False):
    """
    input:
        df (DataFrame): dataframe that includes the timeseries whos extrema shall be found
        col (str): name of the dataframe column that contains the timeseries
        n (int): number of timesteps to be checked before and after
    output:
        df (DataFrame): returns the dataframe with two columns (binary values) "minimum" and "maximum"
        plot (matplotlib plot): if plot==True, a plot will be displayed with the extrema marked
    """
    minName = "{}dayMinimum".format(str(n))
    maxName = "{}dayMaximum".format(str(n))
    
    df[minName] = df.iloc[argrelextrema(df[col].values, np.less_equal,
                        order=n)[0]][col]
    df[maxName] = df.iloc[argrelextrema(df[col].values, np.greater_equal,
                        order=n)[0]][col]
    
    if plot==True:
        fig, ax = plt.subplots()
        ax.plot(df[col])
        ax.plot(df[minName], marker="o", color="green")
        ax.plot(df[maxName], marker="o", color="red")
        plt.title("{}-Day-Extrema".format(str(n)))
    
    # 1 if day is an extremum, 0 if not
    df[minName] = (df[minName]/df[minName]).fillna(0)
    df[maxName] = (df[maxName]/df[maxName]).fillna(0)
    df["{}dayExtremum".format(str(n))] = df[minName]-df[maxName]
    
    #df.rename(columns={"minimum": "{}dayMinimum".format(str(n)), 
    #                   "maximum": "{}dayMaximum".format(str(n)), 
    #                   "extremum": "{}dayExtremum".format(str(n))}, inplace=True)
    
    return df
In [48]:
for days in [5, 20, 60]:
    stockHist_comp = mark_localExtrema(stockHist_comp, "close_adj", days, plot=True)
In [49]:
# create indicator that has value 1 if the next n day(s) after day x, the target value is higher/equal than on day x and -1 if not
# (similar to extrema but not the same, since extrema consider higher/lower values on both sides, and this weak trend indicator only the "right/future"-side)
def create_nDayTrendWeakIndicator(df, target, n, plot=False):
    """
    input:
        df (DataFrame): data frame containing the target value
        target (str): name of target value
        n (int): number days that the target value has to be higher or lower after
        plot (bool): output will be a plot if True
    output:
        df (DataFrame): data frame containing the input data frame supplemented with the new indicator(s)
    """
    df["temp0"] = 1
    for days in range(1,n+1):
        df["temp"] = (df[target] - df[target].shift(days)) / -abs(df[target] - df[target].shift(days))
        df["temp0"] = df["temp0"] + df["temp"]
    
    df["{}dayTrendWeak".format(str(days))] = df["temp0"].apply(lambda x : 1 if (x == n+1) else (-1 if (x == -n+1) else 0))
    
    df.drop(columns=["temp0", "temp"], inplace=True)
    
    if plot:
        plot_data = [
            go.Scatter(
                x=df.index,
                y=df[target],
                marker_color="blue",
                name='target values'
            ),
            go.Scatter(
                x=df.index,
                y=df["{}dayTrendWeak".format(str(days))].replace(0, np.nan).replace(-1, np.nan)*df[target],
                marker_color="green",
                mode="markers",
                opacity=.6,
                marker_size=10,
                name='weak positive trend'
            ),
            go.Scatter(
                x=df.index,
                y=-df["{}dayTrendWeak".format(str(days))].replace(0, np.nan).replace(1, np.nan)*df[target],
                marker_color="red",
                mode="markers",
                opacity=.6,
                marker_size=10,
                name='weak negative trend'
            )
        ]

        plot_layout = go.Layout(title='Trend Indicator ({}-day-weak)'.format(n), template="plotly_white")
        fig = go.Figure(data=plot_data, layout=plot_layout)
        fig.show()
        fig.write_html("data/results/reports/trend_weak_plot.html")
    
    return df
In [50]:
stockHist_comp = create_nDayTrendWeakIndicator(stockHist_comp, "close_adj", n=10, plot=True)
In [51]:
# create indicator that has value 1 if the next n day(s) after day x, the target value increases EVERY day without pullback, -1 if it decreases EVERY day and 0 if not
def create_nDayTrendStrongIndicator(df, target, n, plot=False):
    """
    input:
        df (DataFrame): data frame containing the target value
        target (str): name of target value
        n (int): number days that the target value has to be higher or lower continuesly respectively to each day before
        plot (bool): output will be a plot if True
    output:
        df (DataFrame): data frame containing the input data frame supplemented with the new indicator(s)
    """
    df["temp0"] = 1
    for days in range(1,n+1):
        df["temp"] = (df[target].shift(days-1) - df[target].shift(days)) / -abs(df[target].shift(days-1) - df[target].shift(days))
        df["temp0"] = df["temp0"] + df["temp"]
    
    df["{}dayTrendStrong".format(str(days))] = df["temp0"].apply(lambda x : 1 if (x == n+1) else (-1 if (x == -n+1) else 0))
    
    df.drop(columns=["temp0", "temp"], inplace=True)
    
    if plot:
        plot_data = [
            go.Scatter(
                x=df.index,
                y=df[target],
                marker_color="blue",
                name='target values'
            ),
            go.Scatter(
                x=df.index,
                y=df["{}dayTrendStrong".format(str(days))].replace(0, np.nan).replace(-1, np.nan)*df[target],
                marker_color="green",
                mode="markers",
                opacity=.6,
                marker_size=10,
                name='strong positive trend'
            ),
            go.Scatter(
                x=df.index,
                y=-df["{}dayTrendStrong".format(str(days))].replace(0, np.nan).replace(1, np.nan)*df[target],
                marker_color="red",
                mode="markers",
                opacity=.6,
                marker_size=10,
                name='strong negative trend'
            )
        ]

        plot_layout = go.Layout(title='Trend Indicator ({}-day-strong)'.format(n), template="plotly_white")
        fig = go.Figure(data=plot_data, layout=plot_layout)
        fig.show()
        fig.write_html("data/results/reports/trend_strong_plot.html")
    
    return df
In [52]:
stockHist_comp = create_nDayTrendStrongIndicator(stockHist_comp, "close_adj", n=5, plot=True)
Add future close_adj prices as well as return for each timestep
In [53]:
# calculate the return with close_adj after n future timesteps including the return in %
def calc_futureReturn(df, n):
    """
    input:
        df (DataFrame): data frame containing the close_adj values for the return calculation
        n (int): number of days, the return shall be calculated with
    output:
        df (DataFrame): input dataframe supplemented with the new indicators/features
    """
    df["{}dayReturn".format(str(n))] = df.close_adj.shift(n) - df.close_adj
    df["{}dayReturn_perc".format(str(n))] = 1 - (df.close_adj.shift(n) / df.close_adj)
    
    return df
In [54]:
for days in [1, 2, 3, 4, 5]:
    stockHist_comp = calc_futureReturn(stockHist_comp, days)
In [75]:
fig = go.Figure()
fig.update_layout(title="Return_perc after the following day", yaxis_title="Return [%]", template="plotly_white", showlegend=True)
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp["1dayReturn_perc"], name="1dayReturn_perc"))
fig.add_hline(y=0)

fig.show()
fig.write_html("data/results/reports/1dayReturn_perc_plot.html")
In [56]:
# add future n close_adj values to each timestep
def add_futureValues(df, n):
    """
    input:
        df (DataFrame): data frame containing the close_adj values
        n (int): function will add a column where each row shows the close_adj value of the n'th day after the actual close_adj value
    output:
        df (DataFrame): input dataframe supplemented with the new indicators/features
    """
    df["close_adj_in{}days".format(str(n))] = df.close_adj.shift(n)
    
    return df
In [57]:
for days in [1, 2, 3, 4, 5]:
    stockHist_comp = add_futureValues(stockHist_comp, days)

Final feature: Buy, sell or keep recommendation (ONLY THEORETICAL THOUGHTS - NOT IMPLEMENTED YET)

  • some technical indicators can be calculated for the current or last day, some can only be calculated several days in the past
  • criteria might be subjective
  • Sell Recommendation increasing with:
    • fundamental analysis indicates a loss of stock value
    • negative momentum after positive momentum? (negative change of momentum derivative?!)
    • predicted Maximum
    • predicted negative TrendWeak
    • predicted negative TrendStrong
    • predicted close value higher than single moving average (SMA)
    • predicted close value falls under upper Bollinger Band (BB) (after having crossed it)
  • Buy Recommendation increasing with:
    • fundamental analysis indicates a gain of stock value
    • positive momentum after negative momentum? (positive change of momentum derivative?!)
    • predicted Minimum
    • predicted positive TrendWeak
    • predicted positive TrendStrong
    • predicted close value lower than single moving average (SMA)
    • predicted close value falls above lower Bollinger Band (BB) (after having crossed it)
  • normalize criteria (e.g. return values)
  • add weight to each criterion
  • visualize Recommendation with green/lightblue/red or spectrum color markers in different sizes (big red = strong sell rec, small lightgreen = slight buy/keep tendency)
In [58]:
# try to transform indicators to a form, where negative/positive values equal a sell/buy recommendation (the higher the absolute value, the stronger)
def create_buySellKeepRec(df):
    df["buySellKeepRec"] = 0
    df["buySellKeepRec"] = df["60dayExtremum"] + df["10dayTrendWeak"] + df["5dayTrendStrong"]
    #df.loc[~df["buySellKeepRec"].isin([3,-3])] = 0
    
    return df

stockHist_comp = create_buySellKeepRec(stockHist_comp)

def SetColor(x):
    if(x < -1):
        return "red"
    elif(-1<= x <=1):
        return "white"
    elif(x > 1):
        return "green"

fig = go.Figure()
fig.add_trace(go.Scatter(x=stockHist_comp.index, y=stockHist_comp.close_adj, mode="markers+lines", marker=dict(size=(stockHist_comp.buySellKeepRec.abs()+.2)*5, opacity=.6, color=(list(map(SetColor, stockHist_comp.buySellKeepRec))))))
fig.update_layout(title="Buy/Sell/Keep-Recommendation", template="plotly_white")

fig.show()
fig.write_html("data/results/reports/buySellKeepRec_plot.html")

The Buy/Sell/Keep-Recommendation Feature is still under construction and by far not ready yet to be used.

Machine Learning for Trend/Price/Return Prediction and Buy/Keep/Sell Recommendation

My Expectation when I started the project was that there must be two or three commonly used Techniques, but the deeper I dive into the field of Timeseries Forecasting and Machine Learning for Trading, the more I understand that I opened Pandora's Box. In purpose of clearing my mind I will try to form a cluster with the main strategies and algorithms that I found (by far not all) to be used for the given task(s).

  • Machine Learning Models/Algorithms used in the field of Trading and Timeseries Forecasting
    • Volatility Forecasts and Statistical Arbitrage
      • AR (Autoregressive Models)
      • MA (Moving Average Models)
      • ARMA/ARIMA
      • ETS (Exponential Smoothing)
    • LR (Linear Regression)
    • KNN (K-Nearest Neighbours)
    • Bayesian Models
    • SVM (Support Vector Machine)
      • SVC (Support Vector Machine Classifier)
      • SVR (Support Vector Machine Regressor)
    • Decision Trees
    • Random Forests
    • Ensemble Methods (e.g. Bagging)
    • DNN (Deep Neural Networks)
      • CNN (Convolutional Neural Network)
      • RNN (Recurrent Neural Networks)
        • LSTM (Long Short-Term Memory)
      • Q-Learning
      • TFT (Temporal Fusion Transformer)
    • Sentiment Analysis
      • NLP (Natural Language Processing)
  • Methods to improve the models
    • Cross-Validation
    • Boosting (Adaboost, XGBoost)
    • Hyper Parameter Optimization (GridSearch)
    • Feature Engineering
    • Interpreting Measures of Error
      • ME (Mean Error)
      • MPE (Mean Percentage Error)
      • RMSE (Root Mean Squared Error)
      • MAE (Mean Absolute Error)
      • MAPE (Mean Absolute Percentage Error)

After the research I decided to start with an Application of an LSTM Recurrent Neural Network which seems to be a popular approach for Timeseries Forecasting.

Thoughts on Possible Strategies

  • Train Test Split Options
    1. Slice training data into windows/sequences where each sequence has a past-sequence to learn on, and a future-sequence to predict
      • Length of past-sequence and future-sequence should probably fit to the desired final prediction
    2. Use the whole training data as one big window to learn (no direct validation possible) - not sure if recommendable though.
  • Number of Features
    1. Only use one feature to train and to be predicted (e.g. adjusted close value)
    2. Use several features as input to predict one output (e.g. OHLC data including indicators as input to predict only closing value)
      • Take into account that not all technical indicators can be calculated for the current/last day -> Gap between training data and prediction date
    3. Use multi in- and output features to predict e.g. not only close values but trend indicators as well (or use it to predict n days at once)
  • Choice of Timespan
    1. How many steps of past data do i need to feed my model so that it can properly predict the next n values/timesteps?
      • Assumably depends on the stock itself and its corresponding benchmark/sector
      • Take into account, that Alpha Vantage API isn't always up to date (e.g. monday evening there is still data from friday evening)
    2. How can i predict more than just one step into the future?
      • Predict one step, then add this value to the given data and predict the next step (only possible for univariate input)
      • Add the future n values to each timestep for training, so that the model will be trained on predicting the next n steps at once?!
      • DNNs like TFT seem to be able to predict several steps ahead even with multiple output/target

Training a model with multivariate input

Understanding the structure of TimeseriesGenerator-Elements

The Tensorflow Keras Module offers a function called TimeseriesGenerator, that divides past timeseries data into chunks/windows with seperate input arrays and target arrays. In order to understand its parameters I tried to visualize it in the following diagram, hoping it would serve my purpose of prediction. Every color in the plot stands for a single window of the whole timeseries, that will be used during training to predict one step ahead (big black dot). The window length, as well as its starting points can be parametrized.

In [59]:
# divide the whole historic data into batches without scaling for visualization purposes
batch_size=1  # number of windows spread over the whole training or testing data
win_length=500  # timesteps of stock data in a batch
stride=win_length+1  # timesteps between starting points of windows (hopping windows)
df = stockHist_comp.sort_values(by="date", ascending=True)[["close_adj"]].dropna()
df.reset_index(inplace=True)
tsg = TimeseriesGenerator(df[["date", "close_adj"]].to_numpy(), df[["date", "close_adj"]].to_numpy(), length=win_length,
                          stride=stride, sampling_rate=1, batch_size=batch_size)
print("number of timesteps before timeseriesgeneration: ", len(df))

def timeseriesgenerator_decomposition(tsg):
    print("windows in tsg: ", len(tsg))
    print("1 input + 1 output array: ", len(tsg[0]))
    print("window, training timesteps per window, input-features: ", tsg[0][0].shape)
    print("window, output-features (for only one timestep): ", tsg[0][1].shape)
    print("input window shape: ", tsg[0][0][0].shape)
    print("output window shape: ", tsg[0][1][0].shape)
    print("last timestep of the first input window: ", tsg[0][0][0][-1])
    print("output value(s) for the first window: ", tsg[0][1][0])
    
timeseriesgenerator_decomposition(tsg)

# plot windows
fig = go.Figure()
fig.update_layout(title="Timeseries Chunks generated by the TimeSeriesGenerator function", 
                  yaxis_title="Price", template="plotly_white")
fig.add_trace(go.Scatter(x=df.date, y=df.close_adj, name="close_adj"))
for e in range(len(tsg)):
    win_inp = pd.DataFrame(tsg[e][0][0])
    win_inp.columns = ["date", "close_adj"]
    win_inp.index = win_inp.date
    win_inp.drop(columns=["date"], inplace=True)
    fig.add_trace(go.Scatter(x=win_inp.index, y=win_inp.close_adj, name="element "+str(e)))
    fig.add_trace(go.Scatter(mode="markers", x=[tsg[e][1][0][0]], y=[tsg[e][1][0][1]], 
                             marker=dict(symbol="circle", size=15, color="black"), showlegend=False))
fig.show()
fig.write_html("data/results/reports/timeseriesgenerator_plot.html")

# usually "stride" doesn't need to be set to window length, but it helps for the visualization
number of timesteps before timeseriesgeneration:  4306
windows in tsg:  8
1 input + 1 output array:  2
window, training timesteps per window, input-features:  (1, 500, 2)
window, output-features (for only one timestep):  (1, 2)
input window shape:  (500, 2)
output window shape:  (2,)
last timestep of the first input window:  [Timestamp('2006-12-11 00:00:00') 25.2109]
output value(s) for the first window:  [Timestamp('2006-12-12 00:00:00') 25.1991]

Unfortunately, after having spent a lot of time understanding the function and structure of keras' TimeseriesGenerator, I found out that it supposedly only works for single future step prediction and not for multi-step & multi-output predictions (like several feature values for several days in the future). In these cases I would have to write my own functions to create timeserieswindows. However for the first edition of this project I will stay with a one-timestep-prediction of the close_adj value and use the timeseriesgenerator as described.

Predicting one value one step in the future with multivariate input

The following code will split the historic stock data including some indicators in a train and test data set. For both data sets the TimeseriesGenerator creates a bunch of windows for model training and validation. I will use 8 features as input for a timespan of 20 workdays (approx. a month) to predict the close_adj value of the next day.

In [60]:
# df needs to be sorted from old to new data! Target value has to be first feature in df!
df = stockHist_comp.sort_values(by="date", ascending=True)[["close_adj", "SMA", "EMA", "RSI", "volume", "diffCloseSMA", "BBperc", "1dayReturn_perc"]]

test_size=.2  # percentage of data to be used on testing -> few should be enough, to enable model to be trained close to the current stock values
batch_size=32  # number of windows spread over the whole training or testing data
win_length=20  # timesteps of stock data in a batch
epochs=20  # number of training iterations to improve loss
patience=8  # number of epochs without "improving loss" leading to stop the training

# scale according to input value range -> if there are negative values, the data should be normalized between -1 and 1, else between 0 and 1
# normalization should be executed for each timewindow, else the model is trained with lower values for stocks that continuesly gain value (e.g. due to inflation etc.)
scaler = MinMaxScaler()  # maybe use standardscaler for prices, since the future min and max values of price are unknown for now?
data_scaled = scaler.fit_transform(df.dropna())
input_data = data_scaled[:, :]
target = data_scaled[:, 0]

X_train, X_test, Y_train, Y_test = train_test_split(input_data, target, test_size=test_size, shuffle=False)

train_generator = TimeseriesGenerator(X_train, Y_train, length = win_length, sampling_rate = 1, batch_size = batch_size)  # batch_size = number of generated timeseries are created and used?!
test_generator = TimeseriesGenerator(X_test, Y_test, length = win_length, sampling_rate = 1, batch_size = 1)

model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(8, input_shape = (win_length, input_data.shape[1]), return_sequences=True))
model.add(tf.keras.layers.Dense(1))
# model.add(tf.keras.layers.LSTM(16, activation='relu', return_sequences=True))
# model.add(tf.keras.layers.Dense(72))

# model.summary()

early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                                  patience=patience,
                                                  mode="min")  # if the val_loss doesn't change for n=patience iterations, stop.

# what does keras.callbacks.ModelCheckpoit do?

model.compile(loss = tf.losses.MeanSquaredError(),
             optimizer = tf.optimizers.Adam(),
             metrics = [tf.metrics.MeanAbsoluteError()])

history = model.fit(train_generator, epochs = epochs,
                             validation_data = test_generator,
                             shuffle = False,
                             callbacks = [early_stopping])

def visualize_loss(history):
    loss = history.history["loss"]
    val_loss = history.history["val_loss"]
    epochs = range(len(loss))
    plt.figure()
    plt.plot(epochs, loss, "b", label="Training loss")
    plt.plot(epochs, val_loss, "r", label="Validation loss")
    plt.title("Training and Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()
    plt.show()

visualize_loss(history)

model.evaluate(test_generator, verbose=2)
Epoch 1/20
107/107 [==============================] - 4s 21ms/step - loss: 0.0150 - mean_absolute_error: 0.0836 - val_loss: 0.0261 - val_mean_absolute_error: 0.1267
Epoch 2/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0119 - mean_absolute_error: 0.0779 - val_loss: 0.0169 - val_mean_absolute_error: 0.0919
Epoch 3/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0070 - mean_absolute_error: 0.0584 - val_loss: 0.0135 - val_mean_absolute_error: 0.0827
Epoch 4/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0056 - mean_absolute_error: 0.0522 - val_loss: 0.0112 - val_mean_absolute_error: 0.0755
Epoch 5/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0046 - mean_absolute_error: 0.0470 - val_loss: 0.0096 - val_mean_absolute_error: 0.0697
Epoch 6/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0040 - mean_absolute_error: 0.0435 - val_loss: 0.0083 - val_mean_absolute_error: 0.0651
Epoch 7/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0035 - mean_absolute_error: 0.0411 - val_loss: 0.0074 - val_mean_absolute_error: 0.0616
Epoch 8/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0032 - mean_absolute_error: 0.0393 - val_loss: 0.0067 - val_mean_absolute_error: 0.0589
Epoch 9/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0029 - mean_absolute_error: 0.0380 - val_loss: 0.0061 - val_mean_absolute_error: 0.0568
Epoch 10/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0027 - mean_absolute_error: 0.0368 - val_loss: 0.0057 - val_mean_absolute_error: 0.0551
Epoch 11/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0026 - mean_absolute_error: 0.0358 - val_loss: 0.0053 - val_mean_absolute_error: 0.0537
Epoch 12/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0024 - mean_absolute_error: 0.0349 - val_loss: 0.0050 - val_mean_absolute_error: 0.0525
Epoch 13/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0023 - mean_absolute_error: 0.0341 - val_loss: 0.0048 - val_mean_absolute_error: 0.0515
Epoch 14/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0022 - mean_absolute_error: 0.0333 - val_loss: 0.0046 - val_mean_absolute_error: 0.0506
Epoch 15/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0021 - mean_absolute_error: 0.0327 - val_loss: 0.0044 - val_mean_absolute_error: 0.0498
Epoch 16/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0020 - mean_absolute_error: 0.0320 - val_loss: 0.0042 - val_mean_absolute_error: 0.0491
Epoch 17/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0019 - mean_absolute_error: 0.0315 - val_loss: 0.0041 - val_mean_absolute_error: 0.0485
Epoch 18/20
107/107 [==============================] - 2s 14ms/step - loss: 0.0018 - mean_absolute_error: 0.0309 - val_loss: 0.0040 - val_mean_absolute_error: 0.0480
Epoch 19/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0018 - mean_absolute_error: 0.0304 - val_loss: 0.0039 - val_mean_absolute_error: 0.0474
Epoch 20/20
107/107 [==============================] - 2s 15ms/step - loss: 0.0017 - mean_absolute_error: 0.0300 - val_loss: 0.0038 - val_mean_absolute_error: 0.0470
838/838 - 1s - loss: 0.0038 - mean_absolute_error: 0.0470 - 964ms/epoch - 1ms/step
Out[60]:
[0.0037895527202636003, 0.04696320742368698]

I wrote a function to save the "latest" model. It will overwrite the last model but also create a copy in a backup folder.

In [61]:
# function to save keras models including backups
def save_model(model, name):
    timestamp = str(datetime.datetime.now().strftime("%Y%m%d_%H%M"))
    model.save("data/models/{}/current/".format(name))  # overwrite recent model
    model.save("data/models/{}/backup/{}/".format(name, timestamp))  # create backup with timestamp
In [62]:
# save model
# save_model(model, "model")

In case that the training doesn't work as planned, or time shall be saved, the latest model can be loaded with the following command.

In [63]:
# load model
# model = tf.keras.models.load_model("data/models/model/current/")
Backtesting: Pick Random History Data Chunk (from the test data set to avoid bias) in the according win_length format and compare prediction to true value

In order to visualize what the model predicts, I wrote some code that fits the model on a random timeseries window of the test data set and calculates an error for the difference between the predicted and the actual future close_adj price. This code snippet can be manually repeated for a better understanding. In a future edition of this project, I will use the code in a loop to calculate and compare the errors of different model parametrizations.

In [64]:
timeseriesgenerator_decomposition(test_generator)
windows in tsg:  838
1 input + 1 output array:  2
window, training timesteps per window, input-features:  (1, 20, 8)
window, output-features (for only one timestep):  (1,)
input window shape:  (20, 8)
output window shape:  ()
last timestep of the first input window:  [0.72210608 0.74605365 0.75421708 0.51051728 0.11836718 0.58899619
 0.41477414 0.58196648]
output value(s) for the first window:  0.7095864274231067
In [65]:
test_predictions = model.predict(test_generator)
test_predictions.shape
Out[65]:
(838, 20, 1)

I don't understand why the shape of the prediction on the test_generator delivers 180 values per timeseries window! (see chapter "Questions to the Reviewer")

In [85]:
# evaluate prediction of a random timeserieswindow (scaler must match format! - not implemented yet)
window_nr = random.randrange(len(test_generator))
hist = test_generator[window_nr][0][0]
fut = test_generator[window_nr][1][0]
pred = test_predictions[window_nr][-1]
empty = np.empty((1,input_data.shape[1]-1))
empty[:] = np.NaN
fut_resh = np.append(fut, empty).reshape(1, -1)
pred_resh = np.append(pred, empty).reshape(1, -1)
window = np.append(hist, fut_resh, axis=0)
window = np.append(hist, pred_resh, axis=0)
window = pd.DataFrame(scaler.inverse_transform(window))

# calculate error for close_adj as target
thresh = 0.05
plaus_min = window.iloc[-3, 0] + stockHist_comp["1dayReturn"].min()*(1+thresh)
plaus_max = window.iloc[-3, 0] + stockHist_comp["1dayReturn"].max()*(1+thresh)
# normalize fut and pred
fut_norm = (window.iloc[-2, 0] - plaus_min) / (plaus_max - plaus_min)
pred_norm = (window.iloc[-1, 0] - plaus_min) / (plaus_max - plaus_min)
error = abs((pred_norm-fut_norm)*100)

"""
# calculate error for 1dayReturn_perc as target
plaus_min = stockHist_comp["1dayReturn_perc"].min()*(1+thresh)
plaus_max = stockHist_comp["1dayReturn_perc"].max()*(1+thresh)
# normalize fut and pred
fut_norm = (window.iloc[-2, 0] - plaus_min) / (plaus_max - plaus_min)
pred_norm = (window.iloc[-1, 0] - plaus_min) / (plaus_max - plaus_min)
error = (pred_norm-fut_norm)*100
# double error if prediction has the opposite sign of the true value
if window.iloc[-2, 0]*window.iloc[-1, 0] < 0:
    error = error*3
"""

fig = go.Figure()
fig.update_layout(title="Random test data prediction compared to actual value", 
                  yaxis_title="Price", xaxis_title="Timesteps", template="plotly_white")

# history values
fig.add_trace(go.Scatter(x=window.index[:-2], y=window.iloc[:-2, 0], name="History Data"))
# true value
fig.add_trace(go.Scatter(x=[window.index[-2]], y=[window.iloc[-2, 0]], name="Real Future Value", mode="markers", 
                         marker=dict(symbol="circle-open-dot", color="green", size=15, opacity=.6, line=dict(width=2))))
# predicted value
fig.add_trace(go.Scatter(x=[window.index[-2]], y=[window.iloc[-1, 0]], name="Predicted Future Value", mode="markers+text", 
                         marker=dict(symbol="y-up", color="red", size=15, line=dict(color="red", width=3)), text=["Error: {:.0f}%".format(error)], textposition="top left"))
# plaus_min & plaus_max
fig.add_hrect(y0=plaus_min, y1=plaus_max, line_width=1, fillcolor="lightblue", opacity=0.2, name="plausible value range")
fig.add_hline(y=plaus_min, line=dict(color="orange"), annotation_text="1dayReturn Minimum - 5%", annotation_position="bottom", name="plaus_min")
fig.add_hline(y=plaus_max, line=dict(color="orange"), annotation_text="1dayReturn Maximum + 5%", annotation_position="top", name="plaus_max")

fig.show()
fig.write_html("data/results/reports/backtest_plot.html")
Thoughts/Findings on the machine Learning results
  • As we can see, the predictions are pretty bad (no need to calculate measures of error in detail). Methods to improve the model will be mentioned in the next step.
  • Especially for other stocks which tend to increase a lot stronger over time, the prediction delivers values that are mostly way too low.
    • My guess is that the "wrong" normalization is one of the major reasons. Each timeseries window should be normalized on its own scale! If not, the trained values tend to be a lot lower than the future values in the test/validation data. In addition the calculated error and loss will be biased.
    • Shortening the window_length compensates this error a little but does not improve the model! It rather tends to overfitting so that the validation loss rises with more epochs.
  • Using only univariate input (close_adj values) seems to be the best option in this case. Probably the other features rather confused the model than helping its prediction.
  • Another thing is the error calculation. I tried to normalize the predicted and true value before calculating the error, where minimum and maximum are not 0 and infinity (since the predicted value could in theory become any value). So the question is, which min and max value would be plausible for the prediction.
    • Min and max target values during the past win_length period wouldnt be perfect either since the true value could still be higher or lower. But not defining the limits would lead to better/lower errors if the prediction is just anywhere close to the last known value.
    • A better choice might be the last known value in the window +- the absolute maximum/minimum of the 1dayReturn data including a small threshold, since it is very unlikely that the return after the next day will be higher (absolute) than in the last years - even though not impossible.
    • Attention: This might cause problems for stocks with high volatility and low prices since the lower limit could be a negative target value.
  • When I use the 1dayReturn feature as target to be predicted (just put it in the first column of the dataframe), the predictions seem to be more accurate. Since my main goal is to predict an up or down trend (as BuySellKeepRecommendation) I will punish predictions in the wrong direction stronger by multiplying the error with a factor if the sign of the 1dayReturn prediction is different to the true sign.
    • Unfortunately the plaus_min/max limits would have to be corrected for 1dayReturn since the max or min return should not be added to the last value. (Due to lack of time and readability i will not implement a seperate algorithm in this notebook.)
    • I also noted that in the 1dayReturn_perc predictions, the predicted value is almost not changing and never positive. Due to the given error calculation it might still look better than it should (Bias!).

Improving the Model (not part of this version due to lack of time)

Using GridSearch or XGBoost Parameter Optimization
  • when a feasable model is found, the choice of used parameters should be improved with methods like grid search (iterating the training/testing process finding the parameters with the lowest error rates)
  • parameters of interest could be: features, number of features, window length, batch_size, epochs, compiling metrics?!
Data Preparation
  • adding more features (including weights for each feature) might help the model if the features are well chosen.
  • use adequate scaler for each feature (minmaxscaler -1 to 1 for the ones with negative values, for the others 0 to 1 or standardscaler)
  • normalize data for each timeseries window (not sure how this works with timeseriesgenerator - anyways planned to write a windowingfunction by myself)
  • take bias and overfitting into account (for example when comparing different methods/target_values whilst they work with different error metrics)

Communicating Results

Saving data to server/database

In [67]:
stockHist_comp.to_csv("data/results/{}_stockHist_comp.csv".format(symbol))

Web App

I created my own homepage with the help of a free Bootstrap Template and published it on the web server of my personal Synology NAS that unfortunately does not provide an easy way to work with the Apache backend server. Maybe in the future I will deploy the web app on a platform like heroku with flask, but for now it is more convenient to execute the stock analysis seperately and only publish its results on my homepage:

http://kalinka.synology.me

Structure without Backend:

Structure for Backend Part (not implemented yet - difficulties with apache server configuration on personal NAS)

  • page for finding the symbol(s) of a stock or a company when searching for its name or sector (returns df with results/stock information)
  • page for finding general stock information when searching for its symbol
  • page that executes/updates/delivers the complete analysis for a stock with its symbol as input

Visualizations

Finance Chart
In [68]:
def plot_candlestick(df, name, window_size, save=False):
    INCREASING_COLOR = '#17BECF'
    DECREASING_COLOR = '#7F7F7F'
    
    # initial candlestick chart
    data = [ dict(
        type = 'candlestick',
        open = df.open,
        high = df.high,
        low = df.low,
        close = df.close,
        x = df.index,
        yaxis = 'y2',
        name = name,
        increasing = dict( line = dict( color = INCREASING_COLOR ) ),
        decreasing = dict( line = dict( color = DECREASING_COLOR ) ),
    ) ]

    layout=dict()

    fig = dict( data=data, layout=layout )
    
    # create the layout object
    fig['layout'] = dict()
    fig['layout']['plot_bgcolor'] = 'rgb(250, 250, 250)'
    fig['layout']['xaxis'] = dict( rangeselector = dict( visible = True ), rangeslider = dict( visible = False) )
    fig['layout']['yaxis'] = dict( domain = [0, 0.2], showticklabels = False, autorange = True, fixedrange=False )
    fig['layout']['yaxis2'] = dict( domain = [0.2, 0.8], autorange = True, fixedrange=False )
    fig['layout']['legend'] = dict( orientation = 'h', y=0.9, x=0.3, yanchor='bottom' )
    fig['layout']['margin'] = dict( t=40, b=40, r=40, l=40 )
    
    # add range buttons
    rangeselector=dict(
        visible = True,
        x = 0, y = 0.9,
        bgcolor = 'rgba(150, 200, 250, 0.4)',
        font = dict( size = 13 ),
        buttons=list([
            dict(count=1,
                 label='reset',
                 step='all'),
            dict(count=1,
                 label='1yr',
                 step='year',
                 stepmode='backward'),
            dict(count=3,
                label='3 mo',
                step='month',
                stepmode='backward'),
            dict(count=1,
                label='1 mo',
                step='month',
                stepmode='backward'),
            dict(count=7,
                label='1 w',
                step='day',
                stepmode='backward'),    
            dict(count=1,
                label='1 d',
                step='day',
                stepmode='backward'),
            dict(step='all')
        ]))

    fig['layout']['xaxis']['rangeselector'] = rangeselector
    
    # set volume bar chart colors
    colors = []

    for i in range(len(df.close)):
        if i != 0:
            if df.close[i] > df.close[i-1]:
                colors.append(INCREASING_COLOR)
            else:
                colors.append(DECREASING_COLOR)
        else:
            colors.append(DECREASING_COLOR)
    
    # calculate bollinger bands for close values
    df[["BBlow", "BBmid", "BBup", "BBwidth", "BBperc"]] = pandas_ta.bbands(close=df.sort_values(by="date", ascending=True)["close"], length=20)
    
    # calculate SMA and EMA of close
    df['SMA'] = df.sort_values(by="date", ascending=True)["close"].rolling(window=20).mean()
    df['EMA'] = df.sort_values(by="date", ascending=True)["close"].ewm(span=20).mean()

    # add volume bar chart
    fig['data'].append( dict( x=df.index, y=df.volume,                         
                             marker=dict( color=colors ),
                             type='bar', yaxis='y', name='Volume' ) )

    fig['data'].append( dict( x=df.index, y=df.BBup, type='scatter', yaxis='y2', 
                             line = dict( width = 1 ),
                             marker=dict(color='#ccc'), hoverinfo='none', 
                             legendgroup='Bollinger Bands', name='Bollinger Bands') )

    fig['data'].append( dict( x=df.index, y=df.BBlow, type='scatter', yaxis='y2',
                             line = dict( width = 1 ),
                             marker=dict(color='#ccc'), hoverinfo='none',
                             legendgroup='Bollinger Bands', showlegend=False ) )
    
    fig['data'].append( dict( x=df.index, y=df.close, type='scatter', yaxis='y2',
                             line = dict( width = 2 ),
                             marker=dict(color='black'), hoverinfo='none',
                             legendgroup='Close', showlegend=True, name="Close" ) )
    
    fig['data'].append( dict( x=df.index, y=df.SMA, type='scatter', yaxis='y2',
                             line = dict( width = 1 ),
                             marker=dict(color='blue'), hoverinfo='none',
                             legendgroup='SMA', showlegend=True, name="SMA" ) )
    
    fig['data'].append( dict( x=df.index, y=df.EMA, type='scatter', yaxis='y2',
                             line = dict( width = 1 ),
                             marker=dict(color='violet'), hoverinfo='none',
                             legendgroup='EMA', showlegend=True, name="EMA" ) )
    
    # plot
    iplot( fig, filename = 'Plotly Finance Chart', validate = False )
    
    # save figure as html
    if save:
        fig = go.Figure(fig)
        fig.write_html("data/results/reports/finance_chart.html")

plot_candlestick(df=stockHist_comp, name=symbol, window_size=20, save=True)
Table with the values of the recent buy/sell/keep decision criteria
In [69]:
display(stockHist_comp[["close_adj", "volume", "SMA", "EMA", "diffCloseSMA", "BBperc", "RSI", "buySellKeepRec"]].iloc[0:1])

last_day_table = stockHist_comp[["close_adj", "volume", "SMA", "EMA", "diffCloseSMA", "BBperc", "RSI", "buySellKeepRec"]].iloc[0:1]
last_day_table = last_day_table\
    .to_html()\
    .replace('<table border="1" class="dataframe">','<table class="table table-striped">') # use bootstrap styling
close_adj volume SMA EMA diffCloseSMA BBperc RSI buySellKeepRec
date
2021-12-16 89.64 1364574.00 89.55 89.40 0.09 0.51 51.31 0.00

Report

  • create and show or download a pdf or html file containing all the dataframes and visualizations
  • run and export notebook programmatically:
    • jupyter nbconvert --execute --ExecutePreprocessor.timeout=300 --to html stock_analysis_04.ipynb
In [70]:
html_string = '''
<html>
    <head>
        <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css">
        <style>body{ margin:0 100; background:white; }</style>
    </head>
    <body>
        <h1>Stock Analysis of "''' + symbol + '''"</h1>

        <!-- *** Section 1 *** --->
        <h2>Quick Data Overview</h2>
        
        <h3>Finance Chart</h3>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="finance_chart.html"></iframe>
        <p></p>
        
        <h3>Latest available data:</h3>
        ''' + last_day_table + '''
        <p></p>
        
        <h3>Close Prices, Dividends and Splits</h3>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="close_plot.html"></iframe>
        <p></p>
        
        <!-- *** Section 2 *** --->
        <h2>Technical Analysis</h2>
        
        <h3>Bollinger Bands, Moving Averages and RSI</h3>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="bb_rsi_plot.html"></iframe>
        <p></p>
        
        <h3>Distance to the mean</h3>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="diffCloseSMA_plot.html"></iframe>
        <p></p>
        
        <h3>1-Day Returns in [%]</h3>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="1dayReturn_perc_plot.html"></iframe>
        <p></p>

        <h3>Seasonal Decomposition</h3>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="seasonal_decomposition_plot.html"></iframe>
        <p></p>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="seasonal_plot.html"></iframe>
        <p></p>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="residual_plot.html"></iframe>
        <p></p>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="seasonal_lag_plot.html"></iframe>
        <p></p>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="seasonal_predict_plot.html"></iframe>
        <p></p>
        
        <h3>Trends</h3>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="trend_weak_plot.html"></iframe>
        <p></p>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="trend_strong_plot.html"></iframe>
        <p></p>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="buySellKeepRec_plot.html"></iframe>
        <p></p>
        
        <!-- *** Section 3 *** --->
        <h2>Predictions</h2>
        
        <h3>Adjusted Close Value on the next day</h3>
        <p>Disclaimer: The displayed chart is used as place holder and does <u>not</u> predict future stock values in any way!</p>
        <iframe width="1000" height="550" frameborder="0" seamless="seamless" scrolling="no" src="backtest_plot.html"></iframe>
        <p></p>
    </body>
</html>'''
In [71]:
f = open('data/results/reports/report.html','w')
f.write(html_string)
f.close()

Send Alert/Info Message via Mail

I wrote a small function that (with the help of a google API key) sends mails programmatically. It shall be used when finished to inform or alarm me if my server detects a strong buy or sell recommendation (according to my definitions above).

Use Case:

  • several criteria indicate a strong down (sell) or up (buy) trend
    • condition example: Buy, because technical indicators are "positive" or promising and predicted return is positive
  • mail subject: affected stocks
  • message: "possible up/down trend for stock xy detected, check web app for further analysis..."
In [72]:
def send_message(subject, body, to):
    """
    input:
        subject (str): String describing the subject of the message
        body (str): String containing the message text
        to (str): String containing the mail address
    output:
        E-Mail
    """
    with open('api_keys/google.txt') as f:
        google_app_password = f.readlines()[0]
    
    msg = EmailMessage()
    msg.set_content(body)
    msg["subject"] = subject
    msg["to"] = to
    
    user = "thomaskallnik@gmail.com"
    msg["from"] = user
    password = google_app_password  # generated google app password
    
    server = smtplib.SMTP("smtp.gmail.com", 587)
    server.starttls()
    server.login(user, password)
    server.send_message(msg)
    
    server.quit()
    
# send_message("Error", "Why can't i send it as an SMS???", "thomaskallnik@gmail.com")

Conclusion

Even though my work on the project shall not end yet, this will be it for the first blog post. There is still a lot to improve and not all goals have been achived, but here is what I learned to each requirement from the outline in the beginning:

  • create a pipeline gathering stock data, calculating features and saving on personal NAS
    • it's not a pipeline yet, but I prepared all functions and steps necessary for a pipeline in an automated process
  • use machine learning to estimate performance, trend and eventually prices
    • i learned a lot on the different methods and models that are available for the tasks and know now where to dive deeper in the future (especially understanding the single layers)
    • a basic implementation of an LSTM Recurrent Neural Network has been implemented to "predict" a stock feature one step into the future, which is definately not helping a lot regarding the API's data up-to-dateness
    • possible improvements on the model/algorithm have been mentioned
  • create automated daily report with visualizations to be aware of changes in stocks
    • a variety of plotly diagrams and pandas data frames were created to understand the stock development and are summarized in an html report that could be automatically produced
  • send report and/or special info/warnings automated via mail
    • the code snippet for the information/warning mail is ready to use - the criteria for a mail and its content however still needs to be defined
  • publish newest report and non-confidential insights on personal web app (create with Flask/Bootstrap and deploy on personal NAS Server)
    • the front end of my web app is ready and published on my NAS web server including the project notebook and report
    • for the use of a backend I will either need to understand to configuration of the apache configuration on my NAS or deploy the web app on a heroku server with a different domain
  • create Python module for standard plots and calculations
    • for an easier review of the project, I left all functions in the notebook
    • nevertheless I used the knowledge from Udacity's course to create my own python modules for other projects
  • use git for version control on remote repository on personal NAS (and share on github)
    • I created and used a git bare remote repository on my NAS as backup
    • in addition, the github repository is available for the review
  • write and publish an article about the project (homepage and or medium)
    • the notebook itself is structured and styled as a blogpost so that no extra document is needed
    • additionally an html report shortly shows all relevant figures and results

Future Prospects

During the project I continuesly developed more ideas about what I could do in the future to improve the functionalities and accuracy of the data. Here are some of them:

    - Portfolio Completion
      - include algorithm that iterates through stocks in a specific field/industry and identifies the ones with promising performance or lower price than value
    - MVO (Mean Variance Optimization): Design Portfolio to reach target return with minimal risk (use covariance e.g. negative correlation between stocks)
      - identify the efficient frontier (best portfolios (weights) in terms of return/risk)
    - Try other machine learning models and combine their results (inspect TFTs)
    - Estimate Confidence to evaluate predictions
    - Use Sentiment Analysis to estimate impacts on the stock markets that are hard to find with technical analysis

Questions to the Reviewer

Project:

  1. Are there state of the art stock prediction and/or timeseries forecast methods/strategies/models/classifiers that i haven't mentioned in this project, that i could look into?
  2. Is there a recommended (free) solution for sms/whatsapp messaging with python?
  3. Is the common use case, to train a model as good as possible, save it and then load and use it every day to predict the next days? Or is the goal to train the model each day new (with the newest data) and predict the next days?
    • I guess it's gonna be a compromise of saving weights, using the model to predict a few days/times and then improving the model by training it more with new data after some time?!
  4. In the Keras Documentation i found the "WindowGenerator"... Is this function now deprecated and equal to keras "TimeseriesGenerator" or why can't i find more information on WindowGenerator anywhere? Why wouldn't they update their tutorial? https://www.tensorflow.org/tutorials/structured_data/time_series
  5. I did not understand the batch_size parameter for the timeseriesgenerator. The number of timeseries windows generated is len(timeseriesgeneratorobject) and the window length (timesteps per window) is the parameter "length", so what is the meaning of the batch_size? Apparently increasing the batch_size shortens the training time a lot and reduces the number of windows...
  6. When i predict the test_generator object, i receive an array of the shape (number of windows, window length, 1) even though i thought the model would only predict one single value per window (not win_length predictions). What am i doing wrong?
  7. How could i implement my personal definition of error in the model training process above?
    • (instead of model.compile(loss = tf.losses.MeanSquaredError(), optimizer = tf.optimizers.Adam(), metrics = [tf.metrics.MeanAbsoluteError()])

General:

  1. Is there any way, I can get some statistics about my studying on the Udacity platform this year?
    • For example: Time spent on the platform across all courses (and separately for each course).

Merry Xmas, thanks in advance for the review and sorry for the long texts!

Sources & Credits